Introduction

The purpose of this project is to take a player’s indiviual statistics and attempt to determine whether they won or lost that game. I will then use this predictive capability and try to guess the winner of the largest tournament of the year, Call of Duty Champs.


What is the Call of Duty World League?

Call of Duty is a first-person shooter that first began in 2003. Since then, it has become one of the largest multiplayer video game franchises to exist. During this time, a competitive scene for the game has gained traction. In 2016, the Call of Duty World League was born – a sponsored league that hosts major tournaments throughout the year for the best players in the world to play in. In these events, these pros play three different game modes to decide the winner of a series. These game modes are Hardpoint, Search and Destroy, and then a third game mode that often changes yearly. For the data that we are covering, the third game mode is Control. All of the teams in the league consist of 5 players, and the series are Best of 5’s.

embed_url("https://www.youtube.com/watch?v=VQC0aZuGBFs&t=2740s")

Game Modes in Circuit

Hardpoint

In Hardpoint, the two teams must fight over a point on the map where every second they spend in this point, they gain one point. This point is called the “Hardpoint.” If two teams are in the Hardpoint at the same time, then neither teams collects points. Every sixty seconds, the Hardpoint changes locations on the map, so teams must make tactical decisions to be able to rotate across the map. The first team to 250 points wins the map.

embed_url("https://www.youtube.com/watch?v=VQC0aZuGBFs&t=2740s") %>%
  use_start_time(6*60 + 35)

Search and Destroy

In Search and Destroy, the two teams play rounds where each player only has one life; if you die, you are dead until the next round. The objective is to either kill the entire other team before the time limit, or if you are on offense, then you can plant the bomb. If the bomb detonates after 45 seconds, then you also win the round. The first team to win 6 rounds wins the map.

embed_url("https://www.youtube.com/watch?v=VQC0aZuGBFs&t=2740s") %>%
  use_start_time(18*60 + 39)

Control

In Control, there is an offense team and a defense team. There are multiple rounds where each team switches off between offense and defense. Each team has 30 lives per round. The first time to win three rounds wins the map. The offensive team is trying to either capture two points on the map, or eliminate all 30 lives of the other team. The defensive team is trying to either defend the two points before the time rounds out, or eliminate all 30 lives of the other team.

embed_url("https://www.youtube.com/watch?v=VQC0aZuGBFs&t=2740s") %>%
  use_start_time(45*60 + 40)

Why is this model useful?

This model is useful because it will allow us to see whether a player’s statistics may have contributed to a win or not. As a fan of COD Competitive, there is a lot of debate on statistics and it’s importance, so I wanted to look directly at the impact of a player’s statistics.


Load Packages

All the packages are loaded below.

library(readr)
library(tidyverse)
library(tidymodels)
library(janitor)
library(sqldf)
library(sjmisc)
library(vembedr)
library(rpart)
library(rpart.plot)
library(parsnip)
library(vip)
library(randomForest)
library(dplyr)
library(tidyr)
library(data.table)
library(randomForest)
library(datasets)
library(caret)
library(ROCR)
library(tune)
library(kknn)
library(RColorBrewer)

Load Data

This project makes use of official CWL data that is uploaded on Github. All data is organized relatively cleanly and all missing data is reported.

proleague2019 <- read_csv(url("https://raw.githubusercontent.com/Activision/cwl-data/master/data/data-2019-07-05-proleague.csv"))
fortworth2019 <- read_csv(url("https://raw.githubusercontent.com/Activision/cwl-data/master/data/data-2019-03-17-fortworth.csv"))
london2019 <- read_csv(url("https://raw.githubusercontent.com/Activision/cwl-data/master/data/data-2019-05-05-london.csv"))
anaheim2019 <- read_csv(url("https://raw.githubusercontent.com/Activision/cwl-data/master/data/data-2019-06-16-anaheim.csv"))
proleagueFinals2019 <- read_csv(url("https://raw.githubusercontent.com/Activision/cwl-data/master/data/data-2019-07-21-proleague-finals.csv"))

# all stats for all major tournaments (EXCEPT CHAMPS) in BO4 (2019)
majors2019 <- rbind(proleague2019, fortworth2019, london2019, anaheim2019, proleagueFinals2019)

# champs will act as our test data; we will try and predict the winner
champs2019 <- read_csv(url("https://raw.githubusercontent.com/Activision/cwl-data/master/data/data-2019-08-18-champs.csv"))

What data are we concerned with?

In order to determine a win for a game, we will need to address Hardpoint, Search and Destroy, and the Control separately. Each of these gamemodes have different parameters, so we will have to fit models for each gamemode.

Hardpoint:

Search and Destroy:

Control:


Data Split

I will be splitting my data just before I work on my models. I found that this made the most sense, as I will have to split multiple times for multiple different game modes. This way, it will be easier to organize.


Data Cleaning and Organization

The data below is for all of the majors throughout the season, except for COD Champs. I will use the Champs data later as a way to further test my model. The raw data from each major is merged into one major dataset, further broken up into Hardpoint, SND, and Control datasets.

All Majors 2019 data

Below is the code for the merging of the data from all Majors during the season. In order to make sure that the data was ready for use, I went through a few cleaning processes:

  1. cleaned names
  2. removed games with data loss
  3. removed all players that had less than 50 games played, as they would skew the data
  4. manually changed Win/Loss into a dummy variable
  5. manually added a column that determined the role of each player
# CLEANING
majors2019 <- majors2019 %>% clean_names(.)

# new dataset that contains all of the missing data, just in case
majors2019_missing <- sqldf('SELECT * FROM majors2019 WHERE match_id LIKE "missing%"')

# whole event data, all players and all maps, where player names are organized alphabetically
majors2019 <- majors2019[order(majors2019$player),]

# removes missing values
majors2019 <- sqldf('SELECT * FROM majors2019 WHERE match_id NOT LIKE "missing%"')

# calculates all the players that have played more than 50 games
player_numgames <- count(majors2019, player) %>% subset(., n > 50) %>% remove_cols(n)

# includes all existing data for all players that have played more than 50 games (arbitrary number)
majors2019 <- sqldf('SELECT * FROM majors2019 WHERE player IN player_numgames')

# removes all matches where damage = 0; almost always occurs as a result of data loss
majors2019 <- sqldf('SELECT * FROM majors2019 WHERE damage_dealt != "0"')

# changes W to 1, L to 0
majors2019$win <- ifelse(majors2019$win == "W", 1, 0) %>%
  as.factor()

# assigning a role to each player to allow for more precise comparisons
playerRoles <- majors2019 %>%
  group_by(player) %>%
  count(player, fave_weapon) %>%
  top_n(1, n) %>%
  mutate(role = fave_weapon) %>%
  subset(select = -c(fave_weapon, n))

# A player's role is defined as a sub (1), flex (2), or an ar (3).

# replace fav gun with corresponding role
playerRoles$role <- str_replace(playerRoles$role, "Saug 9mm", "1")
playerRoles$role <- str_replace(playerRoles$role, "Maddox RFB", "2")
playerRoles$role <- str_replace(playerRoles$role, "ICR-7", "3")

# making factors
playerRoles$role <- factor(playerRoles$role)

# manually adjustment for player TJHaly
playerRoles <- playerRoles[-c(83), ]

# joining two sets together
majors2019 <- dplyr::inner_join(playerRoles, majors2019, by = "player")

Now that this was Majors dataset was created, I was able to easily subset from it for each different game mode.

Hardpoint subset

Below is a subset of all the Hardpoint data for the season. I used the “sqldf” package as a way to refresh on my SQL skills, as well as remove any data that I deemed unnecessary. This will make it easier to manage and organize in the future.

# all 2019 hardpoint data
hp2019 <- sqldf('SELECT player, k_d, role, win, assists, damage_dealt, player_spm, hill_time_s, hill_captures, hill_defends, x2_piece, x3_piece, x4_piece FROM majors2019 WHERE mode == "Hardpoint"')
hp2019 <- hp2019[order(hp2019$player),]

Search and Destroy subset

Below is a subset of all the Search and Destroy data for the season. This will make it easier to manage and organize in the future. I also add a few different columns: fb_round_ratio and fb_avg.

Note: Ultimately, I ended up not using fb_avg in my models, but I left it as an interesting player-driven statistic.

# all 2019 SND data
snd2019 <- sqldf('SELECT match_id, team, player, role, win, kills, deaths, k_d, assists, damage_dealt, player_spm, bomb_sneak_defuses, bomb_plants, bomb_defuses, snd_rounds, snd_firstbloods, snd_1_kill_round, snd_2_kill_round, snd_3_kill_round, snd_4_kill_round, x2_piece, x3_piece, x4_piece FROM majors2019 WHERE mode == "Search & Destroy"')

# adds new column with fb/round ratio
snd2019 <- add_column(snd2019, fb_round_ratio = snd2019$snd_firstbloods/snd2019$snd_rounds)

# adding a new column with average first bloods for the season
snd2019 <- snd2019 %>%
  group_by(player) %>%
  mutate(fb_avg = mean(snd_firstbloods))

# puts data in alphabetical order
snd2019 <- snd2019[order(snd2019$player),]

Control subset

Below is a subset of all the Control data for the season. This will make it easier to manage and organize in the future.

# all 2019 CONTROL data
control2019 <- sqldf('SELECT player, role, win, k_d, assists, damage_dealt, player_spm, x2_piece, x3_piece, x4_piece, ctrl_firstbloods, ctrl_firstdeaths, ctrl_captures FROM majors2019 WHERE mode == "Control"')
control2019 <- control2019[order(control2019$player),]

Champs 2019 dataset

Similar to the Majors dataset, I also needed to tidy up the Champs data. I followed these steps:

  1. cleaned names
  2. removed missing data
  3. manually changed Win/Loss into dummy variable
  4. added roles for each player
champs2019 <- champs2019 %>% clean_names(.)
champs2019 <- champs2019[order(champs2019$player),]
champs2019 <- sqldf('SELECT * FROM champs2019 WHERE match_id NOT LIKE "missing%"')
champs2019 <- sqldf('SELECT * FROM champs2019 WHERE damage_dealt != "0"')

# changes W to 1, L to 0
champs2019$win <- ifelse(champs2019$win == "W", 1, 0) %>%
  as.factor()

champs2019 <- dplyr::inner_join(playerRoles, champs2019, by = "player")

Hardpoint CHAMPS subset

Below is a subset of the Hardpoint data used for my Champs analysis. I selected directly for the grand finals match, as that would be the match that I would analyze.

# CHAMPS 2019 hardpoint data
hpChamps <- sqldf('SELECT team, end_time, match_id, player, k_d, role, win, kills, deaths, x, assists, damage_dealt, player_spm, hill_time_s, hill_captures, hill_defends, x2_piece, x3_piece, x4_piece FROM champs2019 WHERE series_id = "champs-bracket-grand-finals-0" AND mode = "Hardpoint"')
hpChamps <- hpChamps[order(hpChamps$team),]

Search and Destroy CHAMPS subset

Below is a subset of the Search and Destroy data used for my Champs analysis. I selected directly for the grand finals match, as that would be the match that I would analyze.

# CHAMPS 2019 SND data
sndChamps <- sqldf('SELECT team, end_time, match_id, player, win, role, k_d, assists, damage_dealt, player_spm, bomb_sneak_defuses, bomb_plants, bomb_defuses, snd_rounds, snd_firstbloods, x2_piece, x3_piece, x4_piece, snd_1_kill_round, snd_2_kill_round, snd_3_kill_round, snd_4_kill_round FROM champs2019 WHERE mode == "Search & Destroy"')

# adds new column with fb/round ratio
sndChamps <- add_column(sndChamps, fb_round_ratio = sndChamps$snd_firstbloods/sndChamps$snd_rounds)

# adding a new column with average first bloods for the season
sndChamps <- sndChamps %>%
  group_by(player) %>%
  mutate(fb_avg = mean(snd_firstbloods))

# puts data in alphabetical order
sndChamps <- sndChamps[order(sndChamps$team),]

Control CHAMPS subset

Below is a subset of the Control data used for my Champs analysis. I selected directly for the grand finals match, as that would be the match that I would analyze.

# CHAMPS 2019 CONTROL data
controlChamps <- sqldf('SELECT team, end_time, ctrl_firstbloods, ctrl_firstdeaths, ctrl_captures, x2_piece, x3_piece, x4_piece, player, role, win, k_d, assists, damage_dealt, player_spm FROM champs2019 WHERE mode == "Control"')
controlChamps <- controlChamps[order(controlChamps$team),]

Exploratory Data Analysis

For my exploratory data analysis, I will be using just the season data. It will not include the Champs data. It is done on non-split data, since I was personally interested in seeing the top performers for the WHOLE season. Splitting the data, while it would help remove any type of bias on my end, would skew the EDA and would ultimately confuse me more in the future.

Kill/death for season

From what I know about Call of Duty, K/D is always the most important, or at least the most talked about, statistic for a player. It is using an indicator that a player is doing really well for his team. However, it also must be taken with a grain of salt, as some players are known to pad their stats by “baiting” their teammate. The following graphs will look directly on the varying K/D of players over the whole season.

Overall

ggplot(majors2019, aes(x = reorder(player, k_d), y = k_d)) + geom_boxplot() + coord_flip(ylim = c(0, 3.5)) + labs(y = "Kill/death ratio", x = "Player", title = "OVERALL Player K/D's, 2019 Season (BO4), Descending")

As we can see from the above graph, the players with the top overall K/D’s for the season were:

  1. Simp
  2. Brack
  3. Dashy
  4. Skyz
  5. Denz

This is interesting since this season was Simp’s breakout year. Since then, Simp has grown to be regarded as one of the best Call of Duty player’s of all time, having won 3 rings in 4 years; something that even the greats haven’t done.

We can also see that the players with the lowest overall K/D’s for the season were:

  1. MethodZsick
  2. Goonjar
  3. Knight
  4. Felony
  5. Diabolic

Hardpoint

ggplot(hp2019, aes(x = reorder(player, k_d), y = k_d)) + geom_boxplot() + coord_flip(ylim = c(0, 3.5)) + labs(y = "Kill/death ratio", x = "Player", title = "Player K/D's for HARDPOINT, 2019 Season (BO4), Descending")

From this plot, we can see that the top K/D’s in Hardpoint were:

  1. Octane
  2. Dashy
  3. Skrapz
  4. Skyz
  5. Temp

The lowest K/D’s in Hardpoint were:

  1. Aches
  2. MethodZsick
  3. Knight
  4. Felony
  5. Diabolic

Four of these bottom performers were also apart of the lowest overall K/D’s. Hardpoint may be the reason that they all struggled.

Search and Destroy

ggplot(snd2019, aes(x = reorder(player, k_d), y = k_d)) + geom_boxplot() + coord_flip(ylim = c(0, 5)) + labs(y = "Kill/death ratio", x = "Player", title = "Player K/D's for SEARCH AND DESTROY, 2019 Season (BO4), Descending")

From this plot, we can see that the top K/D’s for Search and Destroy were:

  1. Brack
  2. Natshay
  3. Simp
  4. Denz
  5. Havok

The lowest K/D’s in Search and Destroy were:

  1. Asim
  2. Silly
  3. Knight
  4. Goonjar
  5. Diabolic

Once again, we can see that Knight, Goonjar, and Diabolic have not been performing well.

Control

ggplot(control2019, aes(x = reorder(player, k_d), y = k_d)) + geom_boxplot() + coord_flip(ylim = c(0, 3.5)) + labs(y = "Kill/death ratio", x = "Player", title = "Player K/D's for CONTROL, 2019 Season (BO4), Descending")

From this plot, we can see that the top K/D’s for Control were:

  1. Envoy
  2. Simp
  3. Dashy
  4. Maux
  5. Slasher

Throughout all of these top K/D’s, we have repeatedly seen Simp and Dashy at the top, which reflects in their overall K/D’s.

The lowest K/D’s in Control were:

  1. Jetli
  2. Zeeked
  3. Joee
  4. Felony
  5. Rizk

I only recently started watching Call of Duty again in the past few years, which means I didn’t watch this season that I am studying. I find it interesting that I have never heard of any of these bottom players except Felony. They couldn’t put up good numbers, and thus they haven’t made it back into the league.

Search and Destroy First Bloods

Search and Destroy is a game mode that has multiple rounds, where in each round, every player only has one life. A “first blood” is the first kill of the round and is usually highly influential. This a common stat that commentators and the community look at.

Firstblood average

The firstblood average metric is simply just the average amount of firstbloods that a player has in a game.

# player firstblood average for SND 2019

ggplot(snd2019, aes(x = reorder(player, fb_avg), y = fb_avg)) + geom_point() + coord_flip(ylim = c(0, 3)) + labs(y = "Firstblood Average", x = "Player", title = "Player Firstblood Average for SEARCH AND DESTROY, 2019 Season (BO4), Descending")

As we can see from the plot, Huke lead the pack with about 1.7 firstbloods per game. He was followed closely by Envoy, Dylan, Lucky, and Attach with about 1.4 firstbloods per game each. This is interesting since most of these players are SMG (submachine gun) players. Their role is to find routes and open up the round by (hopefully) getting a first blood. The community would likely look at each of these players with high praise; considering that I have heard about everyone except Lucky, these players are great performers even still. Next, I will look at total firstbloods per player.

Firstblood totals

# player firstbloods for SND 2019

ggplot(snd2019, aes(x = reorder(player, snd_firstbloods), y = snd_firstbloods)) + geom_boxplot() + coord_flip(ylim = c(0, 6)) + labs(y = "Firstbloods", x = "Player", title = "Player Firstbloods for SEARCH AND DESTROY, 2019 Season (BO4), Descending")

The reason I plotted this was mainly to look at the outliers in the dataset. I was interested in seeing what players had firstbloods in a single game. By looking at this plot, we can see that Dylan, Lucky, and Slasher all had 6 firstbloods in a single game. I would predict that they all won their respective games. Next, I will look at each players’ firstblood/round ratio.

Firstblood/round

I created this metric as a more realistic way of comparing players’ firstbloods. Obviously, if a game goes on for 11 rounds, a player is way more likely to have more firstbloods than if a game only went for 6 rounds. Because of this, I wanted to see the ratio for the number of firstbloods per round.

# player firstblood/round for SND 2019

ggplot(snd2019, aes(x = reorder(player, fb_round_ratio), y = fb_round_ratio)) + geom_boxplot() + coord_flip(ylim = c(0, 0.6)) + labs(y = "Firstblood/round ratio", x = "Player", title = "Player Firstblood/Round for SEARCH AND DESTROY, 2019 Season (BO4), Descending")

I didn’t know what qualified as a “good” number for this metric, so it was important that I put this in descending order.

The players with the top Firstblood/round ratio were:

  1. Huke
  2. Envoy
  3. Dylan
  4. Lucky
  5. Blazt

Huke, Envoy, Dylan, and Lucky seem to be common factors when it comes to getting firstbloods in Search and Destroy. Next, I want to look at how each player compares with their overall damage dealt.

Overall Damage Dealt

# player damage dealt OVERALL 2019
# created a new dataset for plotting damage_dealt
playerDamage <- sqldf('SELECT player, damage_dealt FROM majors2019 WHERE damage_dealt != "0"')

ggplot(playerDamage, aes(x = reorder(player, damage_dealt), y = damage_dealt)) + geom_boxplot() + coord_flip(ylim = c(0, 10000)) + labs(y = "Damage Dealt", x = "Player", title = "OVERALL Player Damage Dealt, 2019 Season (BO4), Descending")

From the above plot, we can see that the top performers damage wise were:

  1. Temp
  2. Slasher
  3. Octane
  4. Dashy
  5. Envoy

The lowest damage was put out by:

  1. Profeezy
  2. Natshay
  3. Knight
  4. Diabolic
  5. Zeeked

Once again, there are common factors between the highest performers and the lowest performers. It is also worth noting that every player in the top 5 are still competing to this day, whereas I have never heard of any of the bottom 5. Next, I will look at player’s overall score per minute.

Overall Score/Min (SPM)

This is a metric that has always been a part of Call of Duty, but as an average player, I never really look at it. I want to see how this statistic applies to professionals.

# Overall score per minute for 2019 season

ggplot(majors2019, aes(x = reorder(player, player_spm), y = player_spm)) + geom_boxplot() + coord_flip(ylim = c(0, 675)) + labs(y = "Score per minute", x = "Player", title = "OVERALL Player Score per minute, 2019 Season (BO4), Descending")

The players with the highest SPM were:

  1. Dylan
  2. Envoy
  3. Octane
  4. Dashy
  5. Huke

The players with the lowest SPM were:

  1. Knight
  2. Bance
  3. Wailers
  4. JKap
  5. MethodZsick

What surprised me about this plot was to see Bance and JKap in the bottom 5. Bance is still considered a top player to this day and JKap has been a Call of Duty figurehead for years. On the other hand, the common names at the top were, once again, at the top. Next, I will look at each players’ number of wins throughout the year.

Number of Wins

# Overall number of wins for 2019 season

playerwins <- sqldf('SELECT player, win FROM majors2019 WHERE win == "1"') # selects all the wins for each player
playerwins <- playerwins %>% count(player) # counts the number of wins per player

ggplot(playerwins, aes(x = reorder(player, n), y = n)) + geom_bar(stat = 'identity', fill = "lightblue", color = "white") + coord_flip() + labs(y = "Number of Wins", x = "Player", title = "OVERALL Number of Wins per Player, 2019 Season (BO4), Descending")

The top 4 players with the most amount of wins in the season are Slasher, Octane, Kenny, and Enable. The interesting part about this is that all of these players were on the same team, 100 Thieves. They all tied with 116 wins during the season. They are followed closely by Priestahh, who was their fifth teammate. He likely had to sit out of a few matches, or he got picked up shortly after the start of the season. Next, I will look at the distribution of wins.

Win Distribution

playerwins %>%
  ggplot(aes(x = n)) + geom_histogram(binwidth = 15, color = "black", fill = "lightblue")

The number of wins appears to follow a normal distribution, with the median being about 60 wins. The top players had around 100-116 wins.

Wins vs. Statistics

In order to plot the number of wins versus each stat, I will be making individual datasets as I go along. I feel that this section will be useful in determining possible strong indicators for a win or a loss. First, I will begin by plotting the number of wins vs. overall K/D.

Number of Wins vs. Overall K/D

playerkds <- sqldf('SELECT player, k_d FROM majors2019')
playerkds <- aggregate(playerkds$k_d, list(playerkds$player), FUN=mean)
kd_wins <- cbind(playerkds, playerwins)


ggplot(kd_wins, aes(x = x, y = n)) + 
  geom_point() + 
  geom_smooth(method = lm) + 
  labs(title = "Wins vs. Overall K/D", x = "Overall K/D", y = "Number of Wins")

The above plot shows that there is a positive correlation between a player’s overall K/D and the number of wins that they have. This might indicate that K/D is a good predictor for a win.

Number of Wins vs. Firstbloods

playerfb <- sqldf('SELECT player, snd_firstbloods FROM majors2019')
playerfb <- aggregate(playerfb$snd_firstbloods, list(playerfb$player), FUN=mean)
fb_wins <- cbind(playerfb, playerwins)


ggplot(fb_wins, aes(x = x, y = n)) + 
  geom_point() + 
  geom_smooth(method = lm) + 
  labs(title = "Wins vs. Firstblood Average", x = "Average Firstbloods per Game", y = "Number of Wins")

Again, there appears to be a positive correlation between the two variables, but it definitely weaker than with K/D. Nevertheless, firstbloods should make for a decent indicator of a win or loss.

Number of Wins vs. Damage

playerdamage <- sqldf('SELECT player, damage_dealt FROM majors2019')
playerdamage <- aggregate(playerdamage$damage_dealt, list(playerdamage$player), FUN=mean)
damage_wins <- cbind(playerdamage, playerwins)


ggplot(damage_wins, aes(x = x, y = n)) + 
  geom_point() + 
  geom_smooth(method = lm) + 
  labs(title = "Wins vs. Damage Dealt", x = "Average Damage per Game", y = "Number of Wins")

There also appears to be a positive correlation with the number of damage per game and the amount of wins a player has. Damage will likely make for a good predictor of a win or a loss.

Number of Wins vs. Score per Minute

playerspm <- sqldf('SELECT player, player_spm FROM majors2019')
playerspm <- aggregate(playerspm$player_spm, list(playerspm$player), FUN=mean)
spm_wins <- cbind(playerspm, playerwins)


ggplot(spm_wins, aes(x = x, y = n)) + 
  geom_point() + 
  geom_smooth(method = lm) + 
  labs(title = "Wins vs. Score per Minute", x = "Average SPM per Game", y = "Number of Wins")

Score per minute also has a positive correlation with the number of wins. Like all the others, this appears to be a decent predictor moving forward.

EDA Wrap

While there are definitely more metrics to investigate, I believe that my data analysis has provided me with a solid understanding of the dataset and an idea of what to expect in the later parts of my research. Next, I will move on to the exciting part: model building.


Model Building

Because there are multiple different game modes that I am trying to predict, I must fit models to each one. This is the reason that there are so many models below. They are essentially the same thing, but their recipes are different to accommodate new predictor variables and remove unnecessary ones. I will begin with Hardpoint, then move to Search and Destroy, and then wrap up with Control. To reiterate, my goal is to try and predict whether an individual player will win or lose a game based on their statistics in the given game.


Models: Hardpoint

Splitting Data:

Here, I am splitting the Hardpoint data with 80% training and 20% testing. The data is stratified on the “win” variable.

set.seed(3068)

hp2019_split <- hp2019 %>%
  initial_split(prop = 0.8, strata = "win")

hp2019_train <- training(hp2019_split)
hp2019_test <- testing(hp2019_split)

The few following lines of code are just to ensure that my data split correctly and that everything is looking good to proceed.

Head of dataset:

head(hp2019_train)
##    player  k_d role win assists damage_dealt player_spm hill_time_s
## 4   Abezy 0.66    1   0       6         3891      290.9          48
## 5   Abezy 1.18    1   0       8         4480      393.3          55
## 12  Abezy 0.88    1   0      14         4515      322.3          80
## 15  Abezy 0.76    1   0       8         4868      295.7          77
## 16  Abezy 0.83    1   0       4         3954      269.5          35
## 20  Abezy 1.33    1   0       7         4733      400.3          71
##    hill_captures hill_defends x2_piece x3_piece x4_piece
## 4              4            5        2        0        0
## 5              5           14        2        1        0
## 12             9            6        4        0        0
## 15             7           11        5        0        0
## 16             3            6        2        2        0
## 20             6           10        3        0        0

Training dataset dimensions:

dim(hp2019_train)
## [1] 3551   13

Testing dataset dimensions:

dim(hp2019_test)
## [1] 889  13

Distribution for the number of wins in the training set:

prop.table(table(hp2019_train$win))
## 
##         0         1 
## 0.4987328 0.5012672

Creating a recipe and folding:

To begin with my Hardpoint models, I made a recipe that contained all of my predictor variables. I then normalized all of my predictor variables.

hp_recipe <- recipe(win ~ k_d + assists + damage_dealt + 
                        player_spm + hill_time_s + hill_captures + 
                        hill_defends + x2_piece + x3_piece + x4_piece, 
                      data = hp2019_train) %>%
  step_normalize(all_predictors())

After making my recipe, I decided to fold my data with 10 folds and 5 repeats.

hp_train_folds <- vfold_cv(hp2019_train, v = 10, repeats = 5)

Model 1: Decision Tree

In order to create the decision tree, I used the following steps:

Creating a general decision tree specification using rpart:

hp_tree_spec <- decision_tree() %>%
  set_engine("rpart")

Setting a classification decision tree engine:

hp_class_tree_spec <- hp_tree_spec %>%
  set_mode("classification")

Fitting the model:

hp_class_tree_fit <- hp_class_tree_spec %>%
  fit(win ~ k_d + assists + damage_dealt + 
                        player_spm + hill_time_s + hill_captures + 
                        hill_defends + x2_piece + x3_piece + x4_piece, 
                      data = hp2019_train)

Visualizing the decision tree:

hp_class_tree_fit %>%
  extract_fit_engine() %>%
  rpart.plot(roundint=FALSE)

Checking confusion matrix and accuracy of the train data:

augment(hp_class_tree_fit, new_data = hp2019_train) %>%
  conf_mat(truth = win, estimate = .pred_class)
##           Truth
## Prediction    0    1
##          0 1379  601
##          1  392 1179
hp_dt_accuracy <- augment(hp_class_tree_fit, new_data = hp2019_train) %>%
  accuracy(truth = win, estimate = .pred_class)
hp_dt_accuracy
## # A tibble: 1 × 3
##   .metric  .estimator .estimate
##   <chr>    <chr>          <dbl>
## 1 accuracy binary         0.720

Creating a workflow that is ready to tune cost complexity:

hp_class_tree_wf <- workflow() %>%
  add_model(hp_class_tree_spec %>% set_args(cost_complexity = tune())) %>%
  add_recipe(hp_recipe)

Setting up a regular grid:

param_grid <- grid_regular(cost_complexity(range = c(-3, -1)), levels = 10)

Fitting and tuning our model:

hp_dt_tune <- hp_class_tree_wf %>%
  tune_grid(
  hp_class_tree_wf, 
  resamples = hp_train_folds, 
  grid = param_grid)

Plotting our model, which shows what cost-complexity produces the highest accuracy:

autoplot(hp_dt_tune)

Selecting the best performing value and finalizing the workflow:

hp_best_complexity <- select_best(hp_dt_tune)

hp_class_tree_final <- finalize_workflow(hp_class_tree_wf, hp_best_complexity)

hp_class_tree_final_fit <- fit(hp_class_tree_final, data = hp2019_train)

Visualizing the final model:

hp_class_tree_final_fit %>%
  extract_fit_engine() %>%
  rpart.plot(roundint = FALSE)

Checking the accuracy of the final model:

hp_tuned_dt_accuracy <- augment(hp_class_tree_final_fit, new_data = hp2019_train) %>%
  accuracy(truth = win, estimate = .pred_class)
hp_tuned_dt_accuracy
## # A tibble: 1 × 3
##   .metric  .estimator .estimate
##   <chr>    <chr>          <dbl>
## 1 accuracy binary         0.756

As we can see, the final accuracy for the tuned model was slightly higher with an estimate of 0.756, compared to that of the untuned model with an estimate of 0.720.


Model 2: Random Forest

Now it was time to prepare my model. I tuned min_n and mtry, set my mode to “classification”, and set my engine to “ranger.” My workflow was set up to use both my Hardpoint recipe and my Hardpoint random forest model.

hp_rf_model <- rand_forest(min_n = tune(),
                        mtry = tune(),
                        mode = "classification") %>%
  set_engine("ranger")

hp_rf_workflow <- workflow() %>%
  add_model(hp_rf_model) %>%
  add_recipe(hp_recipe)

Next, I set up parameters for the grid that I was going to make. The parameters were set the Hardpoint random forest model and the mtry range was set from 2 to 10. This mtry limit was set slightly below the maximum number of predictors.

hp_rf_parameters <- hardhat::extract_parameter_set_dials(hp_rf_model) %>%
  update(mtry = mtry(range = c(2, 10)))

hp_rf_grid <- grid_regular(hp_rf_parameters, levels = 2)

Then, I ran my model by tuning and fitting, using my folded data and my grid.

hp_rf_tune <- hp_rf_workflow %>%
  tune_grid(resamples = hp_train_folds,
            grid = hp_rf_grid)

The last thing to do was to plot my tuned model.

autoplot(hp_rf_tune)

As we can see from the above plot, it appears that as we add more predictor variables, the accuracy tends to decrease. I hypothesize that this is because k_d is the most significant predictor by far. All of the other predictors are much less significant, and actually worsen the model by overfitting. However, the decrease in accuracy is very small in general.

Checking the accuracy of the final model:

hp_rf_tuned_accuracy <- show_best(hp_rf_tune, metric = "accuracy")
hp_rf_tuned_accuracy
## # A tibble: 4 × 8
##    mtry min_n .metric  .estimator  mean     n std_err .config             
##   <int> <int> <chr>    <chr>      <dbl> <int>   <dbl> <chr>               
## 1     2     2 accuracy binary     0.744    50 0.00304 Preprocessor1_Model1
## 2     2    40 accuracy binary     0.744    50 0.00280 Preprocessor1_Model3
## 3    10    40 accuracy binary     0.739    50 0.00300 Preprocessor1_Model4
## 4    10     2 accuracy binary     0.734    50 0.00298 Preprocessor1_Model2
hp_rf_tuned_accuracy[1,5]
## # A tibble: 1 × 1
##    mean
##   <dbl>
## 1 0.744

We had the highest accuracy of 0.744 with a minimum node size of 2 and an mtry of 2.


Model 3: Logistic Regression

First, I needed to set up my model. I set my engine to “glm” for logistic regression and set the mode to “classification.”

hp_log_reg <- logistic_reg() %>% 
  set_engine("glm") %>% 
  set_mode("classification")

Set up workflow with the model I created last step, as well as the recipe that I created earlier.

hp_log_wkflow <- workflow() %>% 
  add_model(hp_log_reg) %>% 
  add_recipe(hp_recipe)

Fit the model to the folded data:

hp_log_fit <- fit_resamples(hp_log_wkflow, hp_train_folds)

Collecting metrics based on the folded data:

collect_metrics(hp_log_fit)
## # A tibble: 2 × 6
##   .metric  .estimator  mean     n std_err .config             
##   <chr>    <chr>      <dbl> <int>   <dbl> <chr>               
## 1 accuracy binary     0.748    50 0.00339 Preprocessor1_Model1
## 2 roc_auc  binary     0.834    50 0.00296 Preprocessor1_Model1

Fitting the model to the whole dataset, not just the folds:

hp_log_fit_train <- fit(hp_log_wkflow, hp2019_train)

Assessing model performance with the training data:

predict(hp_log_fit_train, new_data = hp2019_train, type = "prob")
## # A tibble: 3,551 × 2
##    .pred_0 .pred_1
##      <dbl>   <dbl>
##  1   0.845  0.155 
##  2   0.745  0.255 
##  3   0.242  0.758 
##  4   0.860  0.140 
##  5   0.914  0.0865
##  6   0.342  0.658 
##  7   0.816  0.184 
##  8   0.668  0.332 
##  9   0.707  0.293 
## 10   0.638  0.362 
## # … with 3,541 more rows

Confusion matrix:

augment(hp_log_fit_train, new_data = hp2019_train) %>%
  conf_mat(truth = win, estimate = .pred_class)
##           Truth
## Prediction    0    1
##          0 1380  499
##          1  391 1281

Another visualization of the confusion matrix:

augment(hp_log_fit_train, new_data = hp2019_train) %>%
  conf_mat(truth = win, estimate = .pred_class) %>%
  autoplot(type = "heatmap")

Checking accuracy with the training data:

hp_log_reg_accuracy <- augment(hp_log_fit_train, new_data = hp2019_train) %>%
  accuracy(truth = win, estimate = .pred_class)
hp_log_reg_accuracy
## # A tibble: 1 × 3
##   .metric  .estimator .estimate
##   <chr>    <chr>          <dbl>
## 1 accuracy binary         0.749

The accuracy with the training data was about 0.749, which is roughly the same as the accuracy with the folded data.


Model 4: K Nearest Neighbors

Just like all of the previous models, I will follow a similar pathway. The steps are as follows:

Setting up the model. I will be tuning “neighbors.”

hp_knn_model <- 
  nearest_neighbor(
    neighbors = tune(),
    mode = "classification") %>% 
  set_engine("kknn")

Next, I set up the workflow.

hp_knn_workflow <- workflow() %>% 
  add_model(hp_knn_model) %>% 
  add_recipe(hp_recipe)

I then set up the tuning grid.

hp_knn_parameters <- hardhat::extract_parameter_set_dials(hp_knn_model)
hp_knn_grid <- grid_regular(hp_knn_parameters, levels = 2)

Fitting and tuning my model:

hp_knn_tune <- hp_knn_workflow %>%
  tune_grid(resamples = hp_train_folds,
            grid = hp_knn_grid)

Plotting the model:

autoplot(hp_knn_tune, metric = "accuracy")

This plot shows us that as the number of neighbors increases, there is also an increase in accuracy.

Testing the accuracy of the model:

hp_knn_tuned_accuracy <- show_best(hp_knn_tune, metric = "accuracy")
hp_knn_tuned_accuracy
## # A tibble: 2 × 7
##   neighbors .metric  .estimator  mean     n std_err .config             
##       <int> <chr>    <chr>      <dbl> <int>   <dbl> <chr>               
## 1        15 accuracy binary     0.709    50 0.00334 Preprocessor1_Model2
## 2         1 accuracy binary     0.649    50 0.00329 Preprocessor1_Model1

As we can see, the model performs at its best with 15 neighbors, resulting in an accuracy of 0.709.


Comparing Model Performance

hp_accuracies <- c(hp_tuned_dt_accuracy$.estimate,
                    hp_rf_tuned_accuracy[1,5],
                    hp_log_reg_accuracy$.estimate, 
                    hp_knn_tuned_accuracy[1, 4])
hp_accuracies
## [[1]]
## [1] 0.7555618
## 
## $mean
## [1] 0.7438462
## 
## [[3]]
## [1] 0.7493664
## 
## $mean
## [1] 0.7091532

The first number is the decision tree, followed by the random forest, logistic regression, and k-th nearest neighbor model.

Our most accurate model was the decision tree with an accuracy of 0.755, followed by logistic regression’s 0.750, random forest’s 0.744, and k-th nearest neighbor’s 0.710.


Models: Search and Destroy

Splitting Data:

Here, I am splitting the Search and Destroy data with 80% training and 20% testing. The data is stratified on the “win” variable.

set.seed(1)

snd2019_split <- snd2019 %>%
  initial_split(prop = 0.8, strata = "win")

snd2019_train <- training(snd2019_split)
snd2019_test <- testing(snd2019_split)

The few following lines of code are just to ensure that my data split correctly and that everything is looking good to proceed.

Head of the training set:

head(snd2019_train)
## # A tibble: 6 × 25
## # Groups:   player [1]
##   match_id      team  player role  win   kills deaths   k_d assists damage_dealt
##   <chr>         <chr> <chr>  <fct> <fct> <dbl>  <dbl> <dbl>   <dbl>        <dbl>
## 1 737440468739… eUni… Abezy  1     0        11      7  1.57       0         1355
## 2 144010197940… eUni… Abezy  1     0         7      7  1          1         1938
## 3 259560665349… eUni… Abezy  1     0         2      7  0.29       4          796
## 4 146217688303… eUni… Abezy  1     0         7      7  1          0         1275
## 5 177954718572… eUni… Abezy  1     0         9      8  1.12       1         1561
## 6 167778343948… eUni… Abezy  1     0        10      9  1.11       3         1434
## # … with 15 more variables: player_spm <dbl>, bomb_sneak_defuses <dbl>,
## #   bomb_plants <dbl>, bomb_defuses <dbl>, snd_rounds <dbl>,
## #   snd_firstbloods <dbl>, snd_1_kill_round <dbl>, snd_2_kill_round <dbl>,
## #   snd_3_kill_round <dbl>, snd_4_kill_round <dbl>, x2_piece <dbl>,
## #   x3_piece <dbl>, x4_piece <dbl>, fb_round_ratio <dbl>, fb_avg <dbl>

Dimensions of the training set:

dim(snd2019_train)
## [1] 2791   25

Dimensions of the testing set:

dim(snd2019_test)
## [1] 699  25

Distribution of wins for the training set:

prop.table(table(snd2019_train$win))
## 
##        0        1 
## 0.498746 0.501254

Creating a recipe and folding:

To begin with my Search and Destroy models, I made a recipe that contained all of my predictor variables. I then normalized all of my predictor variables.

snd_recipe <- recipe(win ~ k_d + assists + damage_dealt + 
                        player_spm + bomb_sneak_defuses + 
                        bomb_plants + bomb_defuses + snd_firstbloods + 
                       fb_round_ratio + snd_1_kill_round +
                       snd_2_kill_round + snd_3_kill_round +
                       snd_4_kill_round + x2_piece + x3_piece + x4_piece, 
                      data = snd2019_train) %>%
  step_normalize(all_predictors())

After making my recipe, I decided to fold my data with 10 folds and 5 repeats.

snd_train_folds <- vfold_cv(snd2019_train, v = 10, repeats = 5)

Model 1: Decision Tree

Creating a general decision tree specification using rpart:

snd_tree_spec <- decision_tree() %>%
  set_engine("rpart")

Setting a classification decision tree engine:

snd_class_tree_spec <- snd_tree_spec %>%
  set_mode("classification")

Fitting the model:

snd_class_tree_fit <- snd_class_tree_spec %>%
  fit(win ~ k_d + assists + damage_dealt + 
                        player_spm + bomb_sneak_defuses + 
                        bomb_plants + bomb_defuses + snd_firstbloods + 
                       fb_round_ratio + snd_1_kill_round +
                       snd_2_kill_round + snd_3_kill_round +
                       snd_4_kill_round + x2_piece + x3_piece + x4_piece, 
                      data = snd2019_train)

Visualizing the decision tree:

snd_class_tree_fit %>%
  extract_fit_engine() %>%
  rpart.plot(roundint=FALSE)

Checking confusion matrix and accuracy of the train data:

augment(snd_class_tree_fit, new_data = snd2019_train) %>%
  conf_mat(truth = win, estimate = .pred_class)
##           Truth
## Prediction    0    1
##          0 1099  544
##          1  293  855
snd_dt_accuracy <- augment(snd_class_tree_fit, new_data = snd2019_train) %>%
  accuracy(truth = win, estimate = .pred_class)
snd_dt_accuracy
## # A tibble: 1 × 3
##   .metric  .estimator .estimate
##   <chr>    <chr>          <dbl>
## 1 accuracy binary         0.700

Creating a workflow that is ready to tune cost complexity:

snd_class_tree_wf <- workflow() %>%
  add_model(snd_class_tree_spec %>% set_args(cost_complexity = tune())) %>%
  add_recipe(snd_recipe)

Setting up a regular grid:

parameter_grid <- grid_regular(cost_complexity(range = c(-3, -1)), levels = 10)

Fitting and tuning our model:

snd_rf_tune <- snd_class_tree_wf %>%
  tune_grid(resamples = snd_train_folds, grid = parameter_grid)

Plotting our model, which shows what cost-complexity produces the highest accuracy:

autoplot(snd_rf_tune)

Selecting the best performing value and finalizing the workflow:

snd_best_complexity <- select_best(snd_rf_tune, metric = "accuracy")

snd_class_tree_final <- finalize_workflow(snd_class_tree_wf, snd_best_complexity)

snd_class_tree_final_fit <- fit(snd_class_tree_final, data = snd2019_train)

Visualizing the final model:

snd_class_tree_final_fit %>%
  extract_fit_engine() %>%
  rpart.plot(roundint = FALSE)

Checking the accuracy of the final model:

snd_tuned_dt_accuracy <- augment(snd_class_tree_final_fit, new_data = snd2019_train) %>% accuracy(truth = win, estimate = .pred_class)
snd_tuned_dt_accuracy
## # A tibble: 1 × 3
##   .metric  .estimator .estimate
##   <chr>    <chr>          <dbl>
## 1 accuracy binary         0.742

As we can see, the final accuracy for the tuned model was slightly higher with an estimate of 0.742, compared to that of the untuned model with an estimate of 0.700.


Model 2: Random Forest

Now it was time to prepare my model. I tuned min_n and mtry, set my mode to “classification”, and set my engine to “ranger.” My workflow was set up to use both my SND recipe and my SND random forest model.

snd_rf_model <- rand_forest(min_n = tune(),
                        mtry = tune(),
                        mode = "classification") %>%
  set_engine("ranger")

snd_rf_workflow <- workflow() %>%
  add_model(snd_rf_model) %>%
  add_recipe(snd_recipe)

Next, I set up parameters for the grid that I was going to make. The parameters were set with the SND random forest model and the mtry range was set from 1 to 10. This mtry limit was set slightly below the maximum number of predictors.

snd_rf_parameters <- hardhat::extract_parameter_set_dials(snd_rf_model) %>%
  update(mtry = mtry(range = c(1, 10)))

snd_rf_grid <- grid_regular(snd_rf_parameters, levels = 2)

Then, I ran my model by tuning and fitting, using my folded data and my grid.

snd_rf_tune <- snd_rf_workflow %>%
  tune_grid(resamples = snd_train_folds,
            grid = snd_rf_grid)

The last thing to do was to plot my tuned model.

autoplot(snd_rf_tune)

As we can see from the above plot, when we increase the number of predictors, our accuracy and ROC AUC slightly increases. I hypothesize that this is different from the Hardpoint plot because there are more predictors that are significant in Search and Destroy.

Checking the accuracy of the final model:

snd_rf_tuned_accuracy <- show_best(snd_rf_tune, metric = "accuracy")
snd_rf_tuned_accuracy
## # A tibble: 4 × 8
##    mtry min_n .metric  .estimator  mean     n std_err .config             
##   <int> <int> <chr>    <chr>      <dbl> <int>   <dbl> <chr>               
## 1    10    40 accuracy binary     0.709    50 0.00373 Preprocessor1_Model4
## 2    10     2 accuracy binary     0.703    50 0.00376 Preprocessor1_Model2
## 3     1     2 accuracy binary     0.668    50 0.00463 Preprocessor1_Model1
## 4     1    40 accuracy binary     0.666    50 0.00500 Preprocessor1_Model3
snd_rf_tuned_accuracy[1,5]
## # A tibble: 1 × 1
##    mean
##   <dbl>
## 1 0.709

We had the highest accuracy of 0.709 with a minimum node size of 40 and an mtry of 10.


Model 3: Logistic Regression

First, I needed to set up my model. I set my engine to “glm” for logistic regression and set the mode to “classification.”

snd_log_reg <- logistic_reg() %>% 
  set_engine("glm") %>% 
  set_mode("classification")

Setting up workflow with the model I created last step, as well as the recipe that I created earlier.

snd_log_wkflow <- workflow() %>% 
  add_model(snd_log_reg) %>% 
  add_recipe(snd_recipe)

Fit the model to the folded data:

snd_log_fit <- fit_resamples(snd_log_wkflow, snd_train_folds)

Collecting metrics based on the folded data:

collect_metrics(snd_log_fit)
## # A tibble: 2 × 6
##   .metric  .estimator  mean     n std_err .config             
##   <chr>    <chr>      <dbl> <int>   <dbl> <chr>               
## 1 accuracy binary     0.732    50 0.00328 Preprocessor1_Model1
## 2 roc_auc  binary     0.807    50 0.00379 Preprocessor1_Model1

Fitting the model to the whole dataset, not just the folds:

snd_log_fit_train <- fit(snd_log_wkflow, snd2019_train)

Assessing model performance with the training data:

predict(snd_log_fit_train, new_data = snd2019_train, type = "prob")
## # A tibble: 2,791 × 2
##    .pred_0 .pred_1
##      <dbl>   <dbl>
##  1   0.571   0.429
##  2   0.739   0.261
##  3   0.712   0.288
##  4   0.654   0.346
##  5   0.771   0.229
##  6   0.825   0.175
##  7   0.486   0.514
##  8   0.748   0.252
##  9   0.433   0.567
## 10   0.458   0.542
## # … with 2,781 more rows

Confusion matrix:

augment(snd_log_fit_train, new_data = snd2019_train) %>%
  conf_mat(truth = win, estimate = .pred_class)
##           Truth
## Prediction    0    1
##          0 1134  474
##          1  258  925

Another visualization of confusion matrix:

augment(snd_log_fit_train, new_data = snd2019_train) %>%
  conf_mat(truth = win, estimate = .pred_class) %>%
  autoplot(type = "heatmap")

Checking accuracy with the training data:

snd_log_reg_accuracy <- augment(snd_log_fit_train, new_data = snd2019_train) %>%
  accuracy(truth = win, estimate = .pred_class)
snd_log_reg_accuracy
## # A tibble: 1 × 3
##   .metric  .estimator .estimate
##   <chr>    <chr>          <dbl>
## 1 accuracy binary         0.738

The final accuracy that we got was 0.738, which was slightly higher than the accuracy of 0.732 on the folded data.


Model 4: K Nearest Neighbors

Just like all of the previous models, I will follow a similar pathway. The steps are as follows:

Setting up the model. I will be tuning “neighbors.”

snd_knn_model <- 
  nearest_neighbor(
    neighbors = tune(),
    mode = "classification") %>% 
  set_engine("kknn")

Next, I set up the workflow.

snd_knn_workflow <- workflow() %>% 
  add_model(snd_knn_model) %>% 
  add_recipe(snd_recipe)

I then set up the tuning grid.

snd_knn_parameters <- hardhat::extract_parameter_set_dials(snd_knn_model)
snd_knn_grid <- grid_regular(snd_knn_parameters, levels = 2)

Fitting and tuning my model:

snd_knn_tune <- snd_knn_workflow %>%
  tune_grid(resamples = snd_train_folds,
            grid = snd_knn_grid)

Plotting the model:

autoplot(snd_knn_tune, metric = "accuracy")

This plot shows us that as the number of neighbors increases, there is also an increase in accuracy.

Testing the accuracy of the model:

snd_knn_tuned_accuracy <- show_best(snd_knn_tune, metric = "accuracy")
snd_knn_tuned_accuracy
## # A tibble: 2 × 7
##   neighbors .metric  .estimator  mean     n std_err .config             
##       <int> <chr>    <chr>      <dbl> <int>   <dbl> <chr>               
## 1        15 accuracy binary     0.619    50 0.00403 Preprocessor1_Model2
## 2         1 accuracy binary     0.595    50 0.00385 Preprocessor1_Model1

As we can see, the model performs at its best with 15 neighbors, resulting in an accuracy of 0.6188415.


Comparing Model Performance

snd_accuracies <- c(snd_tuned_dt_accuracy$.estimate,
                    snd_rf_tuned_accuracy[1,5],
                    snd_log_reg_accuracy$.estimate, 
                    snd_knn_tuned_accuracy[1, 4])
snd_accuracies
## [[1]]
## [1] 0.7416697
## 
## $mean
## [1] 0.7091295
## 
## [[3]]
## [1] 0.7377284
## 
## $mean
## [1] 0.6188415

The first number is the decision tree, followed by the random forest, logistic regression, and k-th nearest neighbor model.

As we can see from this, the decision tree appears to have the highest accuracy with 0.742. This is followed by logistic regression with 0.738, random forest with 0.709, and k-th nearest neighbors with 0.619.


Models: Control

Splitting Data:

Here, I am splitting the Search and Destroy data with 80% training and 20% testing. The data is stratified on the “win” variable.

set.seed(1)

control2019_split <- control2019 %>%
  initial_split(prop = 0.8, strata = "win")

control2019_train <- training(control2019_split)
control2019_test <- testing(control2019_split)

The few following lines of code are just to ensure that my data split correctly and that everything is looking good to proceed.

Head of train set:

head(control2019_train)
##    player role win  k_d assists damage_dealt player_spm x2_piece x3_piece
## 4   Abezy    1   0 1.30      17         6459      386.2        5        1
## 16  Abezy    1   0 0.88       4         5353      215.9        2        0
## 18  Abezy    1   0 1.00       3         2657      285.7        3        0
## 21  Abezy    1   0 0.70       4         3077      225.0        1        0
## 22  Abezy    1   0 0.75       5         3357      217.5        2        1
## 23  Abezy    1   0 0.78       5         4315      215.3        1        1
##    x4_piece ctrl_firstbloods ctrl_firstdeaths ctrl_captures
## 4         0                1                0             4
## 16        0                1                0             2
## 18        0                0                0             1
## 21        0                0                0             1
## 22        0                1                1             1
## 23        0                1                0             2

Dimensions of train set:

dim(control2019_train)
## [1] 2122   13

Dimensions of test set:

dim(control2019_test)
## [1] 532  13

Win distrbition of train data:

prop.table(table(control2019_train$win))
## 
##         0         1 
## 0.4995287 0.5004713

Creating a recipe and folding:

The begin with my Control models, I made a recipe that contained all of my predictor variables. I then normalized all of my predictor variables.

control_recipe <- recipe(win ~ k_d + assists + damage_dealt + 
                        player_spm + ctrl_firstbloods +
                          ctrl_firstdeaths + ctrl_captures + 
                          x2_piece + x3_piece + x4_piece, 
                      data = control2019_train) %>%
  step_normalize(all_predictors())

After making my recipe, I decided to fold my data with 10 folds and 5 repeats.

control_train_folds <- vfold_cv(control2019_train, v = 10, repeats = 5)

Model 1: Decision Tree

Creating a general decision tree specification using rpart:

control_tree_spec <- decision_tree() %>%
  set_engine("rpart")

Setting a classification decision tree engine:

control_class_tree_spec <- control_tree_spec %>%
  set_mode("classification")

Fitting the model:

control_class_tree_fit <- control_class_tree_spec %>%
  fit(win ~ k_d + assists + damage_dealt + 
                        player_spm + ctrl_firstbloods +
                          ctrl_firstdeaths + ctrl_captures + 
                          x2_piece + x3_piece + x4_piece, 
                      data = control2019_train)

Visualizing the decision tree:

control_class_tree_fit %>%
  extract_fit_engine() %>%
  rpart.plot()

Checking confusion matrix and accuracy of the train data:

augment(control_class_tree_fit, new_data = control2019_train) %>%
  conf_mat(truth = win, estimate = .pred_class)
##           Truth
## Prediction   0   1
##          0 742 250
##          1 318 812
control_dt_accuracy <- augment(control_class_tree_fit, new_data = control2019_train) %>%
  accuracy(truth = win, estimate = .pred_class)
control_dt_accuracy
## # A tibble: 1 × 3
##   .metric  .estimator .estimate
##   <chr>    <chr>          <dbl>
## 1 accuracy binary         0.732

Creating a workflow that is ready to tune cost complexity:

control_class_tree_wf <- workflow() %>%
  add_model(control_class_tree_spec %>% set_args(cost_complexity = tune())) %>%
  add_recipe(control_recipe)

Setting up a regular grid:

parameter_grid <- grid_regular(cost_complexity(range = c(-3, -1)), levels = 10)

Fitting and tuning our model:

control_rf_tune <- control_class_tree_wf %>%
  tune_grid(resamples = control_train_folds, grid = parameter_grid)

Plotting our model, which shows what cost-complexity produces the highest accuracy:

autoplot(control_rf_tune)

As we can see, the accuracy and ROC AUC tends to decrease as the cost-complexity parameter increases.

Selecting the best performing value and finalizing the workflow:

control_best_complexity <- select_best(control_rf_tune, metric = "accuracy")

control_class_tree_final <- finalize_workflow(control_class_tree_wf, control_best_complexity)

control_class_tree_final_fit <- fit(control_class_tree_final, data = control2019_train)

Visualizing the final model:

control_class_tree_final_fit %>%
  extract_fit_engine() %>%
  rpart.plot(roundint = FALSE)

Checking the accuracy of the final model:

control_tuned_dt_accuracy <- augment(control_class_tree_final_fit, new_data = control2019_train) %>% accuracy(truth = win, estimate = .pred_class)
control_tuned_dt_accuracy
## # A tibble: 1 × 3
##   .metric  .estimator .estimate
##   <chr>    <chr>          <dbl>
## 1 accuracy binary         0.827

As we can see, the final accuracy for the tuned model was slightly higher with an estimate of 0.827, compared to that of the untuned model with an estimate of 0.732.


Model 2: Random Forest

Now it was time to prepare my model. I tuned min_n and mtry, set my mode to “classification”, and set my engine to “ranger.” My workflow was set up to use both my Hardpoint recipe and my Hardpoint random forest model.

control_rf_model <- rand_forest(min_n = tune(),
                        mtry = tune(),
                        mode = "classification") %>%
  set_engine("ranger")

control_rf_workflow <- workflow() %>%
  add_model(control_rf_model) %>%
  add_recipe(control_recipe)

Next, I set up parameters for the grid that I was going to make. The parameters were set the Hardpoint random forest model and the mtry range was set from 1 to 9. This mtry limit was set slightly below the maximum number of predictors.

control_rf_parameters <- hardhat::extract_parameter_set_dials(control_rf_model) %>%
  update(mtry = mtry(range = c(1, 9)))

control_rf_grid <- grid_regular(control_rf_parameters, levels = 2)

Then, I ran my model by tuning and fitting, using my folded data and my grid.

control_rf_tune <- control_rf_workflow %>%
  tune_grid(resamples = control_train_folds,
            grid = control_rf_grid)

The last thing to do was to plot my tuned model.

autoplot(control_rf_tune)

As we can see from the above plot, when we increase the number of predictors, our accuracy and ROC and AUC slightly increases. I hypothesize that this is different from the Hardpoint plot because there are more predictors that are significant in Control.

Checking the accuracy of the final model:

control_rf_tuned_accuracy <- show_best(control_rf_tune, metric = "accuracy")
control_rf_tuned_accuracy
## # A tibble: 4 × 8
##    mtry min_n .metric  .estimator  mean     n std_err .config             
##   <int> <int> <chr>    <chr>      <dbl> <int>   <dbl> <chr>               
## 1     9    40 accuracy binary     0.749    50 0.00453 Preprocessor1_Model4
## 2     9     2 accuracy binary     0.742    50 0.00443 Preprocessor1_Model2
## 3     1     2 accuracy binary     0.740    50 0.00420 Preprocessor1_Model1
## 4     1    40 accuracy binary     0.736    50 0.00438 Preprocessor1_Model3
control_rf_tuned_accuracy[1,5]
## # A tibble: 1 × 1
##    mean
##   <dbl>
## 1 0.749

We had the highest accuracy of 0.749 with a minimum node size of 40 and an mtry of 9.


Model 3: Logistic Regression

First, I needed to set up my model. I set my engine to “glm” for logistic regression and set the mode to “classification.”

control_log_reg <- logistic_reg() %>% 
  set_engine("glm") %>% 
  set_mode("classification")

Setting up workflow with the model I created last step, as well as the recipe that I created earlier.

control_log_wkflow <- workflow() %>% 
  add_model(control_log_reg) %>% 
  add_recipe(control_recipe)

Fit the model to the folded data:

control_log_fit <- fit_resamples(control_log_wkflow, control_train_folds)

Collecting metrics based on the folded data:

collect_metrics(control_log_fit)
## # A tibble: 2 × 6
##   .metric  .estimator  mean     n std_err .config             
##   <chr>    <chr>      <dbl> <int>   <dbl> <chr>               
## 1 accuracy binary     0.767    50 0.00504 Preprocessor1_Model1
## 2 roc_auc  binary     0.841    50 0.00467 Preprocessor1_Model1

Fitting the model to the whole dataset, not just the folds:

control_log_fit_train <- fit(control_log_wkflow, control2019_train)

Assessing model performance with the training data:

predict(control_log_fit_train, new_data = control2019_train, type = "prob")
## # A tibble: 2,122 × 2
##    .pred_0 .pred_1
##      <dbl>   <dbl>
##  1  0.0608   0.939
##  2  0.855    0.145
##  3  0.717    0.283
##  4  0.875    0.125
##  5  0.830    0.170
##  6  0.810    0.190
##  7  0.620    0.380
##  8  0.511    0.489
##  9  0.534    0.466
## 10  0.339    0.661
## # … with 2,112 more rows

Confusion matrix:

augment(control_log_fit_train, new_data = control2019_train) %>%
  conf_mat(truth = win, estimate = .pred_class)
##           Truth
## Prediction   0   1
##          0 849 277
##          1 211 785

Another visualization of the confusion matrix:

augment(control_log_fit_train, new_data = control2019_train) %>%
  conf_mat(truth = win, estimate = .pred_class) %>%
  autoplot(type = "heatmap")

Checking accuracy with the training data:

control_log_reg_accuracy <- augment(control_log_fit_train, new_data = control2019_train) %>%
  accuracy(truth = win, estimate = .pred_class)
control_log_reg_accuracy
## # A tibble: 1 × 3
##   .metric  .estimator .estimate
##   <chr>    <chr>          <dbl>
## 1 accuracy binary         0.770

The final model had an accuracy of the 0.770, which was slightly higher than the folded data accuracy of 0.767.


Model 4: K Nearest Neighbors

Just like all of the previous models, I will follow a similar pathway. The steps are as follows:

Setting up the model. I will be tuning “neighbors.”

control_knn_model <- 
  nearest_neighbor(
    neighbors = tune(),
    mode = "classification") %>% 
  set_engine("kknn")

Next, I set up the workflow.

control_knn_workflow <- workflow() %>% 
  add_model(control_knn_model) %>% 
  add_recipe(control_recipe)

I then set up the tuning grid.

control_knn_parameters <- hardhat::extract_parameter_set_dials(control_knn_model)
control_knn_grid <- grid_regular(control_knn_parameters, levels = 2)

Fitting and tuning my model:

control_knn_tune <- control_knn_workflow %>%
  tune_grid(resamples = control_train_folds,
            grid = control_knn_grid)

Plotting the model:

autoplot(control_knn_tune, metric = "accuracy")

This plot shows us that as the number of neighbors increases, there is also an increase in accuracy.

Testing the accuracy of the model:

control_knn_tuned_accuracy <- show_best(control_knn_tune, metric = "accuracy")
control_knn_tuned_accuracy
## # A tibble: 2 × 7
##   neighbors .metric  .estimator  mean     n std_err .config             
##       <int> <chr>    <chr>      <dbl> <int>   <dbl> <chr>               
## 1        15 accuracy binary     0.735    50 0.00426 Preprocessor1_Model2
## 2         1 accuracy binary     0.679    50 0.00472 Preprocessor1_Model1

As we can see, the model performs at its best with 15 neighbors, resulting in an accuracy of 0.735.


Comparing Model Performance

control_accuracies <- c(control_tuned_dt_accuracy$.estimate,
                    control_rf_tuned_accuracy[1,5],
                    control_log_reg_accuracy$.estimate, 
                    control_knn_tuned_accuracy[1, 4])
control_accuracies
## [[1]]
## [1] 0.82705
## 
## $mean
## [1] 0.7494743
## 
## [[3]]
## [1] 0.7700283
## 
## $mean
## [1] 0.7350554

The first number is the decision tree, followed by the random forest, logistic regression, and k-th nearest neighbor model.

As we can see from this, the decision tree has the highest accuracy of 0.827. Logistic regression follows with 0.770, random forest with 0.750, and k-th nearest neighbor with 0.735.

I assume that Control’s models are much more accurate than the other two modes because kills are crucial to how Control plays. Thus, K/D is probably even more important to the models.


Model Finalization

According to my models, the decision tree appeared to be the best model for all three of our game modes. To move forward, we will now finalize our model for each game mode and fit our test data.

Hardpoint

I will begin fitting the final models with Hardpoint.

hp_metric <- metric_set(accuracy)

hp_model_test <- predict(hp_class_tree_final_fit, new_data = hp2019_test)  %>% 
  bind_cols(hp2019_test %>% select(win)) 
hp_model_test
## # A tibble: 889 × 2
##    .pred_class win  
##    <fct>       <fct>
##  1 0           1    
##  2 0           0    
##  3 1           0    
##  4 0           1    
##  5 1           1    
##  6 1           0    
##  7 1           1    
##  8 1           1    
##  9 1           1    
## 10 0           0    
## # … with 879 more rows
hp_model_test %>%
  hp_metric(truth = win, estimate = .pred_class)
## # A tibble: 1 × 3
##   .metric  .estimator .estimate
##   <chr>    <chr>          <dbl>
## 1 accuracy binary         0.693

The final accuracy for Hardpoint test data was 0.693. This is less than what we got on our train data, but is roughly the same.

Search and Destroy

Next, I will fit the Search and Destroy test data.

snd_metric <- metric_set(accuracy)

snd_model_test <- predict(snd_class_tree_final_fit, new_data = snd2019_test)  %>% 
  bind_cols(snd2019_test %>% select(win)) 
snd_model_test
## # A tibble: 699 × 3
##    .pred_class player win  
##    <fct>       <chr>  <fct>
##  1 0           Abezy  0    
##  2 0           Abezy  0    
##  3 0           Abezy  0    
##  4 0           Abezy  1    
##  5 0           Abezy  0    
##  6 1           Abezy  0    
##  7 0           Abezy  0    
##  8 1           Abezy  1    
##  9 1           Abezy  1    
## 10 0           Abezy  1    
## # … with 689 more rows
snd_model_test %>%
  snd_metric(truth = win, estimate = .pred_class)
## # A tibble: 1 × 3
##   .metric  .estimator .estimate
##   <chr>    <chr>          <dbl>
## 1 accuracy binary         0.720

The final accuracy for Search and Destroy test data was 0.720. This is less than what we got on our train data, but is roughly the same.

Control

Next, I will fit the Search and Destroy test data.

control_metric <- metric_set(accuracy)

control_model_test <- predict(control_class_tree_final_fit, new_data = control2019_test)  %>% 
  bind_cols(control2019_test %>% select(win)) 
control_model_test
## # A tibble: 532 × 2
##    .pred_class win  
##    <fct>       <fct>
##  1 1           1    
##  2 0           0    
##  3 1           1    
##  4 1           1    
##  5 0           0    
##  6 0           0    
##  7 0           0    
##  8 1           0    
##  9 0           1    
## 10 0           0    
## # … with 522 more rows
control_model_test %>%
  control_metric(truth = win, estimate = .pred_class)
## # A tibble: 1 × 3
##   .metric  .estimator .estimate
##   <chr>    <chr>          <dbl>
## 1 accuracy binary         0.729

The final accuracy for Control test data was 0.729. This is much less than what we got on our train data, where the accuracy was 0.827.

Attempting to Predict the COD Champs Final

The COD Champs tournament is the largest tournament of the year. During Black Ops 4, the two teams that competed in the grand finals were eUnited and 100 Thieves. I will be using my models to try and predict the winner of this grand finals.

Map 1: Hardpoint

First, I will predict the first Hardpoint map. I created a dataframe which only contained the data from the first map.

The following printed dataframes will show whether a player is predicted to win (1) or lose (0) and what team they are one.

map1_hp <- sqldf('SELECT * FROM hpChamps where end_time = "2019-08-18 21:19:01 UTC"')

map1_hp_prediction <- predict(hp_class_tree_final_fit, map1_hp) %>%
  cbind(map1_hp$player, map1_hp$team)
map1_hp_prediction
##    .pred_class map1_hp$player map1_hp$team
## 1            0         Enable  100 Thieves
## 2            0          Kenny  100 Thieves
## 3            1         Octane  100 Thieves
## 4            0      Priestahh  100 Thieves
## 5            1        Slasher  100 Thieves
## 6            1          Abezy      eUnited
## 7            1        Arcitys      eUnited
## 8            1       Clayster      eUnited
## 9            1      Prestinni      eUnited
## 10           1           Simp      eUnited

My model predicted that Enable, Kenny, and Priestahh would lose, whereas Octane and Slasher would win. On the other hand, every player on eUnited was predicted to win by my model.

Even though Octane’s and Slasher’s statistics were individually adequate for the win, because of the rest of their team’s performance and because of the predictions for eUnited, I would predict that eUnited won the first Hardpoint map.

In reality, eUnited did win the first map 250-126! So far, so good.

My prediction: 1-0 eUnited
Reality: 1-0 eUnited

Map 2: Search and Destroy

Next, I will look at Map 2, Search and Destroy.

map2_snd <- sqldf('SELECT * FROM sndChamps where end_time = "2019-08-18 21:37:20 UTC"')

map2_snd_prediction <- predict(snd_class_tree_final_fit, map2_snd) %>%
  cbind(map2_snd$player, map2_snd$team)
map2_snd_prediction
##    .pred_class map2_snd$player map2_snd$team
## 1            1          Enable   100 Thieves
## 2            0           Kenny   100 Thieves
## 3            0          Octane   100 Thieves
## 4            1       Priestahh   100 Thieves
## 5            0         Slasher   100 Thieves
## 6            1           Abezy       eUnited
## 7            0         Arcitys       eUnited
## 8            0        Clayster       eUnited
## 9            0       Prestinni       eUnited
## 10           1            Simp       eUnited

My model predicted that Priestahh and Enable would win from 100 Thieves, and that Abezy and Simp would win from eUnited. This is an apparent flaw in my approach of determining an overall winner, as this in theory should be a tie. For the sake of extending the series, let’s just say that 100 Thieves wins. Thus, I would predict that 100 Thieves would win the second map.

In reality, 100 Thieves won the second map 6-3.

My prediction: 1-1 Tied
Reality: 1-1 Tied

Map 3: Control

Next, I will look at Map 3, Control.

map3_control <- sqldf('SELECT * FROM controlChamps where end_time = "2019-08-18 21:55:55 UTC"')

map3_control_prediction <- predict(control_class_tree_final_fit, map3_control) %>%
  cbind(map3_control$player, map3_control$team)
map3_control_prediction
##    .pred_class map3_control$player map3_control$team
## 1            0              Enable       100 Thieves
## 2            1               Kenny       100 Thieves
## 3            1              Octane       100 Thieves
## 4            1           Priestahh       100 Thieves
## 5            0             Slasher       100 Thieves
## 6            1               Abezy           eUnited
## 7            0             Arcitys           eUnited
## 8            0            Clayster           eUnited
## 9            0           Prestinni           eUnited
## 10           0                Simp           eUnited

My model predicted that Kenny, Octane, and Priestahh would win from 100 Thieves, and only Abezy would win from eUnited. Thus, I would predict that 100 Thieves would win this match.

In reality, 100 Thieves did win the third map, 3-1! Another success, phew.

My prediction: 2-1 100 Thieves
Reality: 2-1 100 Thieves

Map 4: Hardpoint

Next, I will look at Map 4, Hardpoint.

map4_hp <- sqldf('SELECT * FROM hpChamps where end_time = "2019-08-18 22:11:21 UTC"')

map4_hp_prediction <- predict(hp_class_tree_final_fit, map4_hp) %>%
  cbind(map4_hp$player, map4_hp$team)
map4_hp_prediction
##    .pred_class map4_hp$player map4_hp$team
## 1            0         Enable  100 Thieves
## 2            1          Kenny  100 Thieves
## 3            1         Octane  100 Thieves
## 4            1      Priestahh  100 Thieves
## 5            0        Slasher  100 Thieves
## 6            1          Abezy      eUnited
## 7            1        Arcitys      eUnited
## 8            0       Clayster      eUnited
## 9            0      Prestinni      eUnited
## 10           1           Simp      eUnited

My model predicted that Kenny, Octane, and Priestahh would win from 100 Thieves, and that Abezy, Arcitys, and Simp would win from eUnited. Again, this is an apparent flaw in my methodology. For the sake of extending the series once again, I’ll predict that eUnited will win Map 4.

In reality, eUnited wins map 4 250-219.

My prediction: 2-2 Tie
Reality: 2-2 Tie

Map 5: Search and Destroy

This is what is all comes down to! My model has gotten us to this final map! (Albeit, through a different path.) Lastly, I will predict the winner of the final Search and Destroy to determine the winner of the whole event.

map5_snd <- sqldf('SELECT * FROM sndChamps where end_time = "2019-08-18 22:32:13 UTC"')

map5_snd_prediction <- predict(snd_class_tree_final_fit, map5_snd) %>%
  cbind(map5_snd$player, map5_snd$team)
map5_snd_prediction
##    .pred_class map5_snd$player map5_snd$team
## 1            0          Enable   100 Thieves
## 2            0           Kenny   100 Thieves
## 3            0          Octane   100 Thieves
## 4            0       Priestahh   100 Thieves
## 5            1         Slasher   100 Thieves
## 6            1           Abezy       eUnited
## 7            1         Arcitys       eUnited
## 8            1        Clayster       eUnited
## 9            1       Prestinni       eUnited
## 10           0            Simp       eUnited

My model predicted that Slasher on 100 Thieves would win, and that Abezy, Arcitys, Clayster, and Prestinni would win from eUnited. Thus, eUnited is my overall predicted winner!

In reality, eUnited did win the final map 6-4! My model was correct in the end!

If you would like to see the actual series between 100 Thieves and eUnited, I have linked it below.

embed_url("https://www.youtube.com/watch?v=AfWqhymZnoc")

Conclusion

Discussion

I expect that most of my models had lower accuracy for my test data compared to the train data because of overfitting. I had a feeling that this would happen since the final decision trees were much more complex than the ones before they were tuned. Regardless, I am happy with an accuracy of 0.70 for each model. For a game that is largely a team effort, being able to predict a win with 0.70 by only looking at one player’s statistics is not too bad.

Player Statistic Take Aways

Apart from all the research and code that is currently shown in this project, there were also hundreds of lines of code that didn’t make the cut. I tried multiple different methods of modeling, multiple different ways of organizing data, and multiple different metrics that weren’t included in the original data set. I tried to approach it from both an individual player’s perspective, as well as a team’s perspective. Ultimately, as the project discusses, I went with an individual player’s statistics. From this research, I have learned a lot about the predictive variables that determine a win or a loss for a player.

One thing that I immediately noticed was the massive importance of Kill/Death (KD) ratio. In practically every model, KD was the most important metric in determining a win or a loss. It overpowered most other metrics, which was a shock for me. For example, as discussed before, First Bloods in Search and Destroy are thought to be very important to a win. However, statistically, it didn’t even appear in some of the models.

Another thing that I noticed was how unimportant other metrics were. Some metrics that I thought would be important, weren’t. However, I would be interested in seeing how some of these metrics might be weighted differently if you approach it from a team perspective.

Future Research

In the future, I believe that a more effective approach for predicting a win or a loss would be by looking at a team’s statistics as a whole. You can take this even further by ranking each team using their past statistics and using these to predict future matches; an unsupervised learning experiment. Additionally, by working with team data, it will clear up the issues that I ran into when predicting the Champs Grand Finals. My thinking was inherently wrong from the start and I didn’t even realize until it came time to predict everything.

Additionally, the data that I used for this research was from the CWL in 2018. Since then, the CWL has transformed into the CDL (Call of Duty League), with more focus being placed on data. There are more predictors to look at that just weren’t available for the CWL data. Future research should use CDL data as there are some interesting parameters, such as untraded kills, that might provide some more predictive power. I also think it would be interesting to look at interactions between teams and maps, and try and use this to aid in predicting a winner.

Final Thoughts

This project was a great opportunity to get hands on with the R language and machine learning concepts. Being able to work with Call of Duty data gave me a reason to keep working, and gave me an intrinsic pre-understanding of the data. There is still a lot to learn when it comes to machine learning, but as soon as I learn it, I hope to apply it to my future research ideas. All in all, it gave me a better understanding of a player’s statistics in Call of Duty and will give me an extra thing to think about whenever I watch future competitive matches.